refactor: Support batch export models as views #23052

tomasfarias · 2024-06-18T15:02:49Z

Problem

This PR sets the ground work to support person batch exports by introducing a set of parametrized views that will serve as the basis of batch exports moving forward. This new set of views includes a view for the persons model. Using parametrized views over the old approach has the benefit of being easier to parse the query logic when it's laid out in plain SQL (ideally this should be SQL files, but our migration system works with Python files...).

Changes

Introduce new parametrized views for person and events batch exports.
- The person model is completely new, so no impact on existing functionality. This PR only introduces the view of the model, a follow-up PR will make use of it.
- On the other hand, the event model is changing slightly with the new view:
  - We are now using GROUP BY instead of DISTINCT. This should perform better and use less memory (in particular with optimize_aggregation_in_order enabled).
  - We are now using a PREWHERE clause for filtering first on inserted_at. Once again, this should result in better performance.
  - Event model is divided into 3: "default", "unbounded", and "backfill", with one view for each. Determined by how they use timestamp and inserted_at filters. Should behave the same as before, but laid out in plain SQL instead of composed via Python formatting.
  - For legacy reasons, event model is different according to destination. Eventually, I'll create views for each destination to also lay them out in SQL
- Using the views required refactoring large portions of code.
Introduce new methods to iterate over records asynchronously.
- This requires implementing parsing of pyarrow.RecordBatch.
- Will initially only use these to iterate over records of new models (i.e. persons), and switch to events on a later date.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

Updated a few tests to work with views.
Ran all unit tests.

tomasfarias · 2024-06-18T15:04:09Z

posthog/temporal/tests/batch_exports/conftest.py

+@pytest_asyncio.fixture(scope="module", autouse=True)
+async def create_batch_export_views(clickhouse_client, django_db_setup):


Apparently clickhouse migrations don't really run in tests.

tomasfarias · 2024-06-18T15:04:22Z

posthog/api/test/test_app_metrics.py

@@ -1,6 +1,5 @@
 import datetime as dt
 import json
-import uuid


Every change on this file is just cleaning up.

tomasfarias · 2024-06-18T15:04:54Z

posthog/temporal/common/clickhouse.py

@@ -357,6 +359,23 @@ def stream_query_as_arrow(
            with pa.ipc.open_stream(pa.PythonFile(response.raw)) as reader:
                yield from reader

+    async def astream_query_as_arrow(


Not in use yet, will use for person model exports.

tomasfarias · 2024-06-18T15:05:29Z

posthog/temporal/batch_exports/batch_exports.py

+        # without battle testing it first.
+        # There are already changes going out to the queries themselves that will impact events in a
+        # positive way. So, we can come back later and drop this block.
+        for record_batch in iter_records(client, team_id=team_id, is_backfill=is_backfill, **parameters):


See TODO above: For now, keeping old functionality intact for safety. Will release person model first, test, and then switch events.

tomasfarias · 2024-06-18T15:28:26Z

posthog/temporal/batch_exports/batch_exports.py

@@ -77,78 +77,20 @@ def get_timestamp_field(is_backfill: bool) -> str:
    return timestamp_field


-async def get_rows_count(


We don't count rows anymore, so this function is unused. Cleaning up.

fuziontech · 2024-06-19T13:01:33Z

posthog/batch_exports/sql.py

+    PREWHERE
+        COALESCE(events.inserted_at, events._timestamp) >= {interval_start:DateTime64}
+        AND COALESCE(events.inserted_at, events._timestamp) < {interval_end:DateTime64}


At this point, inserted_at should always be set for all batch exports. Only historical exports require _timestamp, but backfills have already been switched over to query based on timestamp, so they also do not need to check for inserted_at/_timestamp. Removing the colaesce and using only inserted_at reduces the size of the data CH has to fetch by half.

fuziontech

This is great. I'm excited to use Views here...easy to iterate and low risk.

fuziontech · 2024-06-23T15:28:33Z

posthog/temporal/batch_exports/batch_exports.py

    )
+FORMAT ArrowStream


This is exciting.
https://clickhouse.com/docs/en/integrations/data-formats/arrow-avro-orc#arrow-data-streaming

fuziontech · 2024-06-23T15:29:11Z

posthog/temporal/batch_exports/batch_exports.py

+            yield record_batch
+        return
+
+    async for record_batch in client.astream_query_as_arrow(view, query_parameters=parameters):


* refactor: Update metrics to fetch counts at request time * fix: Move import to method * fix: Add function * feat: Custom schemas for batch exports * feat: Frontend support for model field * fix: Clean-up * fix: Add missing migration * fix: Make new field nullable * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * fix: Bump migration number * fix: Bump migration number * refactor: Update metrics to fetch counts at request time * fix: Actually use include and exclude events * refactor: Switch to counting runs * refactor: Support batch export models as views * fix: Merge conflict * fix: Quality check fixes * refactor: Update metrics to fetch counts at request time * fix: Move import to method * fix: Add function * fix: Typing fixes * feat: Custom schemas for batch exports * feat: Frontend support for model field * fix: Clean-up * fix: Add missing migration * fix: Make new field nullable * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * fix: Bump migration number * fix: Clean-up unused code * chore: Clean-up unused function and tests * fix: Clean-up unused function * fix: HTTP Batch export default fields * fix: Remove test case on new column not present in base table * chore: Clean-up unused functions and queries * fix: Only run extra clickhouse queries in batch exports tests * refactor: Remove coalesce and use only inserted_at in queries At this point, inserted_at should always be set for all batch exports. Only historical exports require _timestamp, but backfills have already been switched over to query based on timestamp, so they also do not need to check for inserted_at/_timestamp. Removing the colaesce and using only inserted_at reduces the size of the data CH has to fetch by half. * fix: Remove deprecated test * fix: Add person_id to person model and enforce ordering * refactor: Also add version column --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

tomasfarias requested a review from fuziontech as a code owner June 18, 2024 15:02

tomasfarias commented Jun 18, 2024

View reviewed changes

tomasfarias force-pushed the feat/add-schema-selector-to-batch-exports-ui branch from a2744da to 4f182c7 Compare June 18, 2024 15:22

tomasfarias commented Jun 18, 2024

View reviewed changes

Base automatically changed from feat/add-schema-selector-to-batch-exports-ui to master June 18, 2024 18:04

tomasfarias and others added 22 commits June 19, 2024 10:56

refactor: Update metrics to fetch counts at request time

b43e7af

fix: Move import to method

590f74f

fix: Add function

53c71b6

feat: Custom schemas for batch exports

ea1d33f

feat: Frontend support for model field

81e929c

fix: Clean-up

054a826

fix: Add missing migration

11b47d3

fix: Make new field nullable

106a725

Update UI snapshots for chromium (1)

8b19079

Update UI snapshots for chromium (1)

fdb729a

Update UI snapshots for chromium (1)

ae9870c

Update UI snapshots for chromium (1)

091e380

Update UI snapshots for chromium (1)

af52af9

Update UI snapshots for chromium (1)

46c93de

fix: Bump migration number

c8bc6a6

fix: Bump migration number

fae1b59

refactor: Update metrics to fetch counts at request time

85f5094

fix: Actually use include and exclude events

b2382e2

refactor: Switch to counting runs

bc6dd4e

refactor: Support batch export models as views

ce7d4df

fix: Merge conflict

55f6a5a

fix: Quality check fixes

8e306a4

tomasfarias and others added 15 commits June 19, 2024 10:56

feat: Custom schemas for batch exports

4001daa

feat: Frontend support for model field

0f3cbe3

fix: Clean-up

fde00d4

fix: Add missing migration

ee354cb

fix: Make new field nullable

ccab5f6

Update UI snapshots for chromium (1)

e7b5cb9

Update UI snapshots for chromium (1)

6da056a

Update UI snapshots for chromium (1)

a8b3594

Update UI snapshots for chromium (1)

724c208

Update UI snapshots for chromium (1)

6b19e95

Update UI snapshots for chromium (1)

618d0b8

fix: Bump migration number

536cbdb

fix: Clean-up unused code

0a9343e

chore: Clean-up unused function and tests

f921bb5

fix: Clean-up unused function

dc9535f

tomasfarias force-pushed the refactor/support-for-batch-export-model-views branch from ecb38cb to dc9535f Compare June 19, 2024 08:57

tomasfarias added 4 commits June 19, 2024 11:18

fix: HTTP Batch export default fields

ee3fcb8

fix: Remove test case on new column not present in base table

c4deddf

chore: Clean-up unused functions and queries

f2cad91

fix: Only run extra clickhouse queries in batch exports tests

86c86a2

fuziontech reviewed Jun 19, 2024

View reviewed changes

tomasfarias added 4 commits June 19, 2024 15:14

fix: Remove deprecated test

740188b

fix: Add person_id to person model and enforce ordering

525c682

refactor: Also add version column

2eabe0c

tomasfarias force-pushed the refactor/support-for-batch-export-model-views branch from 6769736 to 2eabe0c Compare June 21, 2024 15:27

fuziontech approved these changes Jun 23, 2024

View reviewed changes

tomasfarias merged commit ca0bf0b into master Jun 24, 2024
85 checks passed

tomasfarias deleted the refactor/support-for-batch-export-model-views branch June 24, 2024 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Support batch export models as views #23052

refactor: Support batch export models as views #23052

tomasfarias commented Jun 18, 2024 •

edited

Loading

tomasfarias Jun 18, 2024

tomasfarias Jun 18, 2024

tomasfarias Jun 18, 2024

tomasfarias Jun 18, 2024

tomasfarias Jun 18, 2024

fuziontech Jun 19, 2024

fuziontech left a comment

fuziontech Jun 23, 2024

fuziontech Jun 23, 2024

		@pytest_asyncio.fixture(scope="module", autouse=True)
		async def create_batch_export_views(clickhouse_client, django_db_setup):

		@@ -77,78 +77,20 @@ def get_timestamp_field(is_backfill: bool) -> str:
		return timestamp_field


		async def get_rows_count(

		)
		FORMAT ArrowStream

refactor: Support batch export models as views #23052

refactor: Support batch export models as views #23052

Conversation

tomasfarias commented Jun 18, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuziontech left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias commented Jun 18, 2024 •

edited

Loading